24 research outputs found

    Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

    Get PDF
    Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model

    Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

    Full text link
    Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach

    Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling

    Get PDF
    High performance computing (HPC) systems underwent a significant increase in their processing capabilities. Modern HPC systems combine large numbers of homogeneous and heterogeneous computing resources. Scalability is, therefore, an essential aspect of scientific applications to efficiently exploit the massive parallelism of modern HPC systems. This work introduces an efficient version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in various domains, such as 3D object recognition, categorization, and 3D face recognition. EPSIA refers to the extended version of the PSIA that integrates various well-known dynamic loop scheduling (DLS) techniques. The present work: (1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the benefits of applying DLS techniques for optimizing the performance of the PSIA; (3) Assesses the performance of the proposed EPSIA by conducting several scalability experiments. The performance results are promising and show that using well-known DLS techniques, the performance of the EPSIA outperforms the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing resources, respectively

    Performance Reproduction and Prediction of Selected Dynamic Loop Scheduling Experiments

    Full text link
    Scientific applications are complex, large, and often exhibit irregular and stochastic behavior. The use of efficient loop scheduling techniques in computationally-intensive applications is crucial for improving their performance on high-performance computing (HPC) platforms. A number of dynamic loop scheduling (DLS) techniques have been proposed between the late 1980s and early 2000s, and efficiently used in scientific applications. In most cases, the computing systems on which they have been tested and validated are no longer available. This work is concerned with the minimization of the sources of uncertainty in the implementation of DLS techniques to avoid unnecessary influences on the performance of scientific applications. Therefore, it is important to ensure that the DLS techniques employed in scientific applications today adhere to their original design goals and specifications. The goal of this work is to attain and increase the trust in the implementation of DLS techniques in present studies. To achieve this goal, the performance of a selection of scheduling experiments from the 1992 original work that introduced factoring is reproduced and predicted via both, simulative and native experimentation. The experiments show that the simulation reproduces the performance achieved on the past computing platform and accurately predicts the performance achieved on the present computing platform. The performance reproduction and prediction confirm that the present implementation of the DLS techniques considered both, in simulation and natively, adheres to their original description. The results confirm the hypothesis that reproducing experiments of identical scheduling scenarios on past and modern hardware leads to an entirely different behavior from expected

    A Methodology for Bridging the Native and Simulated Execution of Parallel Applications

    Get PDF
    Simulation is considered as the third pillar of science, following experimentation and theory. Bridging the native and simulated executions of parallel applications is needed for attaining trustworthiness in simulation results. Yet, bridging the native and simulated executions of parallel applications is challenging. This work proposes a methodology for bridging the native and simulated executions of message passing parallel applications on high performance computing (HPC) systems in two steps: Expression of the software characteristics, and representation and verification of the hardware characteristics in the simulation. This work exploits the capabilities of the SimGrid [3] simulation toolkit’s interfaces to reduce the effort of bridging the native and simulated executions of a parallel application on an HPC system. For an application from computer vision, the simulation of its parallel execution using straightforward parallelization on an HPC cluster approaches the native performance with a minimum relative percentage difference of 5.6%

    Exploring the Relation between Two Levels of scheduling Using a Novel Simulation Approach

    Get PDF
    Modern high performance computing (HPC) systems exhibit a rapid growth in size, both “horizontally” in the number of nodes, as well as “vertically” in the number of cores per node. As such, they offer additional levels of hardware parallelism. Each level requires and employs algorithms for appropriately scheduling the computational work at the respective level. The present work explores the relation between two scheduling levels: batch and application. To understand and explore this relation, a novel simulation approach is presented that bridges two existing simulators from the two scheduling levels. A novel two-level simulator that implements the proposed approach is introduced. The two-level simulator is used to simulate all combinations of three batch scheduling and four application scheduling algorithms from the literature. These combinations are considered for allocating resources and executing the parallel jobs from a workload of a production HPC system. The results of the scheduling experiments reveal the strong relation between decisions taken at the two scheduling levels and their mutual influence. Complementing the simulations, the two-level simulator produces abstract parallel execution traces, which can visually be examined and illustrate the execution of different jobs and, for each job, the execution of its tasks at node and core levels, respectively

    Simulating Batch and Application Level Scheduling Using GridSim and SimGrid

    Get PDF
    Modern high performance computing (HPC) sys- tems are increasing in the complexity of their design and in the levels of parallelism they offer. Studying and enhancing scheduling in HPC became very interesting for two main as- pects. First, scheduling decisions are taken by different types of schedulers such as batch, application, process, and thread schedulers. Second, simulation has become an important tool to examine the design of HPC systems. Therefore, in this work, we study the simulation of different scheduling levels. We used two well-known simulation toolkits, SimGrid and GridSim, in order to support two different scheduling levels, batch and application level scheduling. Each toolkit is extended to support both levels. Moreover, three different scheduling algorithms for each level are implemented and their performance is examined through a real workload dataset. Finally, a comparison for the extension challenges of the two simulators is conducted

    Dynamic Loop Scheduling Using the MPI Passive-Target Remote Memory Access Model

    Get PDF
    Large parallel loops are present in many scientific applications. Static and dynamic loop scheduling (DLS) techniques aim to achieve load balanced executions of applications. The use of DLS techniques in scientific applications, such as the self-scheduling-based techniques, showed significant performance advantages compared to static techniques. On distributed-memory systems, DLS techniques have been implemented using the message-passing interface (MPI). Existing implementations of MPI-based DLS libraries do not consider the novel features of the latest MPI standards, such as one-sided communication, shared-memory window creation, and atomic read-modify-write operations. This poster considers these features and proposes an MPI-based DLS library written in the C language. Unlike existing libraries, the proposed DLS library does not employ a master-worker execution model. Moreover, it contains implementations of five well-known DLS techniques, namely self-scheduling, fixed-size chunking, guided self-scheduling, trapezoid self-scheduling, and factoring. An application from the computer vision is used to assess and compare the performance of the proposed library against the performance of existing solutions. The evaluation results show improved performance and highlight the need to revise and upgrade existing solutions in light of the significant advancements in the MPI standards
    corecore